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TRACE  THEORY  AND  SYSTOLIC  COMPUTATIONS 


Martin  Rem 

Dept,  of  Mathematics  and  Computing  Science 
Eindhoven  University  of  Technology 
P.0.  Box  513,  5600  MB  Eindhoven,  Netherlands 


0.  Introduction 

We  discuss  a  class  of  concurrent  computations,  or  special-purpose  computing  engines, 
that  may  be  characterized  by 

(i)  they  consist  of  regular  arrangements  of  simple  cells; 

(ii)  the  arrangement  consumes  streams  of  input  values  and  produces  streams  of  output 
values; 

(iii)  the  cells  communicate  with  a  fixed  number  of  neighbor  cells  only; 

(iv)  the  communication  behaviors  of  the  cells  are  independent  of  the  values  communicated. 
Such  arrangements  are  often  referred  to  as  systolic  arrays  [5].  Our  computations, 
however,  have  a  few  other  characteristics  that  are  usually  not  found  among  systolic 
arrays: 

(v)  synchronization  of  cells  is  by  message  passing  only; 

(vi)  each  output  value  is  produced  as  soon  as  all  input  values  on  which  it  depends  have 
been  consumed. 

The  formalism  we  use  to  discuss  these  computations  is  trace  theory  [4],  [7],  [8].  Section 
1  is  an  introduction  to  trace  theory,  in  which  only  those  subjects  are  covered  that  are  needed 
to  understand  the  subsequent  sections.  Section  2,  called  Data  Independence,  addresses 
the  question  what  it  means  that  communication  behaviors  are  independent  of  the  values 
communicated.  To  express  the  simplicity  of  (the  communication  behaviors  of)  the  cells 
we  define  in  Section  3  the  concept  of  conservative  processes.  The  results  of  Sections  2  and 
3  are  assembled  into  a  number  of  theorems  that  are  used  in  Sections  4,  5,  and  6.  Each 
of  these  remaining  sections  discusses  an  illustrative  example  of  a  systolic  computation: 
polynomial  multiplication,  cyclic  encoding,  and  palindrome  recognition. 


1.  Processes 

This  section  is  a  trace-theoretic  introduction  to  processes.  A  process  is  an  abstraction 
of  a  mechanism,  capturing  the  ways  in  which  the  mechanism  can  interact  with  its  envi¬ 
ronment.  A  process  is  characterized  by  the  set  of  events  it  can  be  involved  in  and  by  the 
possible  orders  in  which  these  events  can  occur.  Events  are  represented  by  symbols.  Sets  of 
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symbols  are  called  alphabets  and  finite-length  sequences  of  symbols  are  called  traces.  The 
set  of  all  traces  with  symbols  from  alphabet  A  is  denoted  by  A*. 

A  process  is  a  pair  {A,  X),  where  A  is  an  alphabet  and  X  is  a  non-empty  prefix-closed 
set  of  traces  with  symbols  from  A: 

X  C  A* 

X? L-4> 

X  =  prefix) 

where  pref  (X)  denotes  set  X  extended  with  all  the  prefixes  of  traces  in  X: 

pref(X)  =  {t  E  A*  |  ( 3u  :  u  E  A*  :  tu  E  X)} 

For  process  T  we  let  a T  denote  its  alphabet  and  tT  its  set  of  traces:  T  =  (a T,  t T). 

An  example  of  a  process  is 

a,  ab,  aba,  abab , . . « 

(e  denotes  the  empty  trace.)  We  call  this  process  SEM^a,  b).  Its  trace  set  consists  of  all 
finite  alternations  of  a  and  b  that  do  not  start  with  b: 

SEMi(a,b)  -  ({a,b},pref({ab}*)) 

where,  for  X  a  set  of  traces,  X*  denotes  the  set  of  all  finite  concatenations  of  traces  in  X. 

The  central  operators  of  trace  theory  are  projection  and  weaving.  They  are  the  formal 
counterparts  of  abstraction  and  composition  respectively.  The  projection  of  trace  t  on 
alphabet  A,  denoted  by  t~A,  is  obtained  by  removing  from  t  all  symbols  that  are  not  in 
A.  We  may  write  t~a  for  t~{a}.  We  extend  the  definition  of  projection  from  traces  to 
processes  as  follows: 

T~A  =  (a T  n  A,  {t  |  (3u  :  u  £  tT  :  u~ A  =  f)}) 

For  example, 

SEMi(a,  b)~a  =  {{a},  {a}*) 

The  weave  of  processes  T  and  U,  denoted  by  T  w  U,  is  defined  by 
TwU=  (aT  U  a U,  {t  <E  (aT  U  slU)*  |  faT  e  tT  A  t~AU  E  t U}) 

For  example, 

SEMi(a,b)  w  SEM!{b,a)  =  <{«,&},  {e}) 
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and 


SEMi(a,b)  w  SEMi(a,c )  =  ({a,b,c},pref  ({abc,acb}*)} 

Let  t  €  tr.  The  successor  set  of  t  in  T,  denoted  by  S(t,  T),  is  defined  by 
S(t,T)  =  {a  e  aT  |  ta  €  tT} 

For  example, 

S(e,SEMi(a,b))  =  {a} 


and 


S(a,SEM1(a,b)  w  SEM^c))  =  {6,c} 

The  successor  set  of  t  consists  of  all  events  the  mechanism  can  be  involed  in  after  trace 
t  has  occurred.  Projection,  however,  can  cause  the  mechanism  to  refuse  some,  or  all,  of 
the  events  in  the  successor  set.  In  the  theory  of  CSP-processes  [1],[3]  such  processes  are 
called  nondeterministic.  (I  would  rather  call  them  ‘ill-behaved’.)  For  example,  let 

T  =  ({a,b,x,y},  {e,x,xa,y,yb}) 


Then 


T  {a,  b}  =  {{a,  6},  {e,a,6}) 

By  projecting  on  {a,  6}  symbols  x  and  y  have  disappeared:  they  represent  internal  (non¬ 
observable)  events.  Although 

S(s,T~{a,b})  =  {a,b} 

mechanism  T~{a,  6}  may  refuse  to  participate  in  event  a  (or  b )  because  internal  event  y  (or 
x)  has  already  occurred.  These  types  of  refusals  do  not  occur  if  we  project  on  independent 
alphabets.  Alphabet  A  C  a T  is  called  independent  when 

(VtitetT:  S{t,  T)  C  A  =>  S{t,  T)  =  S(t~A,  T~A)) 

Alphabet  {a,  6}  in  the  example  above  is  not  independent: 

S(y,T)  =  {b) 


but 


S(!T{a,6},r-{a,6})  =  S(e,7“{o,6})  =  {a,6} 
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Refusals  may  also  be  caused  by  livelock.  For  example,  let 
T  =  ({a,x},  {a,x}*) 

Then 

T~a  =  ({a},  «) 

If  internal  event  x  occurs  every  time  it  can  be  chosen,  event  a  never  occurs:  it  is  refused 
forever.  Alphabet  A  C  aT  is  called  livelockfree  when 

(Vt  :  t  €  tT  :  (3n  :  n  >  0  :  (Vu  :  u  C  A*  A  tu  £  tT  :  £(u)  <  n))) 

where  £(tt)  denotes  the  length  of  trace  u.  In  the  example  above  alphabets  {x}  and  {a}  are 
not  livelockfree. 

Both  types  of  refusals  are  avoided  by  projecting  on  transparent  alphabets  only.  Al¬ 
phabet  A  C  a T  is  called  transparent  when  A  is  independent  and  slT\A  (the  complement 
of  A  within  aT)  is  livelockfree.  The  importance  of  transparence  is  demonstrated  by  the 
following  theorem  [4]. 

Theorem  1.0  Let  T  be  a  deterministic  CSP-process  and  A  C  aT.  Then  CSP-process 
T~A  is  deterministic  if  and  only  if  A  is  transparent. 

Consequently,  if  we  project  on  transparent  alphabets  only,  no  refusals  can  occur  and  there 
is  no  need  to  resort  to  CSP-processes. 


2.  Data  Independence 

In  this  section  the  events  are  transmissions  of  values  along  channels.  Each  channel  a 
has  a  non-empty  set  V (a)  of  values  that  can  be  transmitted  along  it.  The  alphabets  of  our 
processes  consist  of  pairs  ( a,n },  where  n  6  V(a).  We  consider  processes  such  as 

To  =({a,b}  x  Z 

,  pref  ({{a,  n){b,2*n)  |  w  £  z}")) 

where  T  stands  for  the  set  of  integer  numbers.  For  this  process  V(a)  =  V  (b)  —  Z.  Process 
T0  may  informally  be  described  as  one  that  doubles  integer  numbers. 

Ti  =({«,&}  x  {0,1} 

, pref  {{{a,  0),  (a,  1)(6, 1)}*)) 

In  this  case  V (a)  =  V (6)  =  {0,1}.  This  process  may  be  viewed  as  one  that  filters  out 
zeroes  and  passes  on  ones.  A  process  that  separates  zeroes  and  ones  is 

T2  =({a,6,c}  x  {0,1} 

,  pref  ({{a,  0)  ( b ,  0) ,  (a,  1)  (c,  1)  }*)) 
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A  similar  process  is 


T3  =(({o}  x  {0,1})  U  ({6}  x  {true, false}) 

,  pref({{a,  0)  (6,  true) ,  (a,  1)  (6,  false)  }*)) 

A  process  that  may  be  viewed  as  one  that  arbitrarily  permutes  two  values  is 
Ti  —  ({a,  b,c}  XI 

,pref(({(a,m)(a,n){b,m)(c,n)  \  m  €  I  A  n  G  Z} 

U{(a,  m){a,  n){6,  n){c,  m)  |  m  e  Z  A  n  6  2})*)) 

If  we  are  interested  only  in  the  channels  along  which  the  transmissions  take  place  but 
not  in  the  values  transmitted,  we  replace  each  symbol  {a,  n )  by  its  channel:  7({a,  n))  —  a. 
Function  7  may,  of  course,  be  generalized  from  symbols  to  sets  of  symbols,  (sets  of)  traces, 
and  processes.  For  example 

7(T0)  =  SEM1{a,b) 


and 


7(3i)  =  {{a,b},  {a,ab}*) 

Process  T  is  called  data  independent  when 
(Vi  :  t  €  tr  :  l(S(t,T))  =  S(7(t),7(T))) 

Set  7 (S(t,T))  consists  of  all  channels  along  which  transmission  can  take  place  next.  The 
condition  above  expresses  that  this  set  is  independent  of  the  values  transmitted  thus  far. 
Processes  T0,  Ts,  and  T4  are  data  independent  and  the  other  two  are  not.  For  example, 

7(S«a,0),:Z\))  =  7({(a,0),  (a,  1)})  =  {a} 


but 


5(7({o,0)),7(Ti))  =  S(a,7(Zi))  =  {a, 6} 

In  data  independent  processes  we  can  separate  the  communication  behavior  and  the 
computation  of  the  values.  (This  is  sometimes  called  ‘separating  data  and  control.’)  The 
communication  behavior  of  process  T  is  process  7 (T).  Often  the  communication  behavior 
is  a  rather  simple  process,  but  many  properties,  such  as  transparence,  may  already  be 
concluded  from  it. 

We  shall  specify  communication  behaviors  by  regular  expressions.  For  example,  we 
specify  7(14)  by  the  expression  (a  ;  a  ;  6  ;  c)*.  Notice  that  semicolons  denote  concatenation. 
If  the  regular  expression  generates  language  X  the  process  specified  is  (A,  pref{X)),  where 
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A  is  the  set  of  all  symbols  that  occur  in  the  regular  expression.  This  is  a  rather  primitive 
way  of  specifying  processes,  but  it  suffices  for  most  of  the  examples  in  this  paper. 

Let  A  C  i(aT)  and  £  6  tT.  By  t~ A  we  mean  £~(Ua  :  a  £  A  :  {0}  X  V(a)).  This 
definition  may  be  generalized  from  traces  to  processes.  Then 

l(T~A)  =  7  (T)~A 

Data  independence  is  closed  under  projection  on  transparent  alphabets: 

Theorem  2.0  If  T  is  data  independent,  7(a T)  is  finite,  and  A  C  7(a T)  then 
A  transparent  =>-  T~ A  data  independent 

In  order  to  maintain  data  independence  under  weaving  we  have  to  see  to  it  that  when¬ 
ever  a  communication  along  some  channel  a  can  take  place: 

a  £  7(S'(rar,r))  n  7(S(ra?7,D')) 

there  is  actually  a  transmissible  value  n: 

{(a,  n)  |  (raT) {a,  n)  £  tT} 
n{(a,  n)  |  ( t~aU)(a,n )  G  tf/}  /  <f> 

This  is  expressed  by  the  following  theorem. 

Theorem  2.1  Let  T  and  U  be  data  independent. 

Then 


(Vt  :  t  e  t(T  w  U )  :  7(S(£~aT, T)  n  S{t~aU,U)) 

=  l(S{t~aT,  T))  n  7(£'(ra*7,  U))) 


if  and  only  if 

T  w  U  data  independent  and  7 (T  w  U)  —  7 (T)  w  7 (U) 

In  order  to  allow  a  simple  check  that  (2.1)  holds,  we  partition  the  channels  of  each 
process  into  inputs,  outputs,  and  signals.  We  require 

(i)  for  each  signal  a  set  V(a)  =  {0}; 

(ii)  each  input  a  (of  process  T)  satisfies 


(V£,  n  :  £  €  tT  A  a  6  7 {S{t,  T))  A  n  6  V(a) 
:  (a,n)  e  S(t,T)) 
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i.e.,  T  does  not  restrict  the  values  transmitted  along  its  input  channels.  In  processes  T0 
through  X4  we  can  choose  a  to  be  an  input  and  all  other  channels  outputs.  (We  had  this 
choice  in  mind  in  the  informal  descriptions  of  these  processes.)  When  weaving  we  see  to 
the  observance  of  condition  (2.1)  by  requiring  that  each  symbol  is  an  output  of  at  most 
one  of  the  processes  in  the  weave. 

We  conclude  this  section  with  a — somewhat  informal— discussion  of  a  simple  example. 
Its  inclusion  is  meant  to  show  how  the  computation  of  output  values  may  be  specified. 

The  example  is  a  process  to  compute  a  cumulative  sum.  Its  communication  behavior  is 
specified  by  (o ;  6)*.  Channel  a  is  an  input,  channel  6  is  an  output,  and  V(a)  —  V (b)  =  Z. 
For  t  £  tT  and  0  <  i  <  t{t~a)  we  let  a{i,t)  denote  the  value  of  the  tth  transmission  along 
channel  a  in  trace  t: 

a(i,  t)  —  n 

(3u  :  t  =  u(a,n)  :  €(tt~a)  =  1) 

The  values  to  be  computed  may  then  be  specified  by 
b(i, t)  =  (Sj  :  0  <  3  <  i  :  a(i,t )) 
or,  dropping  the  reference  to  trace  t, 

6(i)  =  (£j  :  0  <  j  <  i  :  a(t)) 

for  i  >  0.  Consequently, 

6(0)  =  a(0)  (2.2) 

and,  for  i  >  0, 

b{i  +  1)  =  6(z)  +  a(i  +  l)  (2.3) 

We  now  describe  how  the  output  values  are  computed.  To  obtain  a  CSP-like  [2] 
notation  we  add  variables  x  and  y  (and  assignments)  to  the  communication  behavior 
[a  ;  6)*.  Our  description  of  the  computation  is 

(2.4) 

The  symbols  of  the  communication  behavior  have  been  changed  into  communication  state¬ 
ments:  as  in  CSP,  each  input  is  postfixed  by  a  question  mark  and  a  variable,  and  each 
output  is  postfixed  by  an  exclamation  point  and  an  expression.  The  effect  of  6!  ( y  +  x )  is 
that  (6,  y  +  x)  is  added  to  the  trace  thus  far  generated,  establishing 

b(i(t~b))  ~y  +  x 

Statement  a?  x,  similarly,  establishes 
a(£(f“a))  —  x 

Step  0  of  the  repetition  in  (2.4)  establishes  a(0)  =  x ,  6(0)  =  a(0)  — as  required  by 
(2.2) — and  y  =  6(0).  Consider,  for  i  >  0,  step  i  +  1  of  the  repetition.  We  have  initially 
y  =  6(*).  Statement  alx  establishes  a{%  +  1)  =  x,  statement  6!(y  +  x)  establishes  b{i  + 1)  = 
6(*‘)  +  a{i  +  1)— as  required  by  (2.3) — and  y  :=  y  +  x  establishes  y  =  b{i  +  1). 


y  :=  0  ;  (a?  x  ;  6!  {y  +  x)  ;  y  y  +  x)* 
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3.  Conservative  Processes 


Communication  behaviors  are  often  rather  simple  processes.  Checking  whether  alphabets 
are  transparent  is  then  not  difficult.  This  is  in  particular  the  case  if  the  communication 
behavior  is  a  conservative  process. 

The  successor  set  S(f,T)  consists  of  all  symbols  that  may  follow  t.  We  now  introduce 
the  after  set  of  t  in  T,  which  consists  of  all  traces  that  may  follow  t: 

after  (t,  T)  =  {u  E  a T*  \  tu  E  t T} 

for  t  E  t T.  Process  T  is  called  conservative  when 

(Vt,  a,  b  :  a  ^  b  A  ta  E  tT  A  tb  E  tT 
:  tab  E  tT  A  tba  E  t T 
A  after{tab,T)  =  after(tba,T)) 

Conservatism  expresses,  informally  speaking,  that  different  events  do  not  disable  each  other 
and  that  the  order  in  which  enabled  events  occur  is  immaterial. 

We  have 

Theorem  3.0  For  conservative  processes  T  each  A  C  a T  is  independent. 

Conservatism  is  closed  under  projection  and  weaving,  as  the  next  two  theorems  express. 
Theorem  3.1  T  conservative  =>•  T~A  conservative 

Theorem  3.2  T  and  U  conservative  =>  T  w  U  conservative 

The  following  theorem  can  be  of  help  to  demonstrate  conservatism  for  some  simple 
processes. 

Theorem  3.3.  Let  R  and  S  be  regular  expressions  consisting  of  symbols  separated  by 
semicolons.  Then  the  process  specified  by  R\  S*  is  conservative.  Moreover,  each  subset  of 
its  alphabet  that  contains  a  symbol  occurring  in  S  is  transparent. 

For  example,  the  process  specified  by 
c  ;  d  ;  (o  ;  a  ;  b  ;  c)* 

is  conservative  and  every  non-empty  subset  of  {a,b,  c}  is  transparent. 

A  process  may  contain  subprocesses.  For  reasons  of  simplicity  we  restrict  ourselves  in 
this  paper  to  processes  that  have  at  most  one  subprocess.  We  always  call  the  subprocess 
p.  A  subprocess  has  a  type,  which  is  again  a  process.  If  the  subprocess  has  type  U  we  let 
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p.U  denote  the  process  obtained  from  U  by  changing  all  symbols  a  into  p.a,  read  ‘p  its  a\ 
For  example,  if 

U  =  ({a,  b},  { E,a,ab }) 


then 


p.U  =  ({p.a,  p.b},  {e,  p.a,  p.a  p.b }) 

Let  process  T  with  aT  =  Abe  specified  by  a  regular  expression  and  let  its  subprocess 
have  type  U .  With  S  denoting  the  process  specified  by  the  regular  expression,  we  require 
slS  =  A  U  a  p.U.  Then,  by  definition, 

T  =  (S  w  p.U)~ A  (3.0) 

For  example,  let  p  be  of  type  SEM\(a,b )  and  let  the  regular  expression  be 

b  ;  (a  ;  p.a  ;  b  ;  p.b)*  (3.1) 

Then  T  —  SEMi(b,  a).  However,  if  p  were  of  type  SEMi(b,  a)  we  would  obtain 
T  =  ({a,6>,  {e,  b,  6a» 

The  internal  symbols  in  S  represent  the  channels  along  which  communications  with  p 
occur.  Each  such  symbol  p.a  is  an  (internal)  input  or  output  of  S  if  the  corresponding 
symbol  a  is  an  output  or  input,  respectively,  of  p.  This  guarantees  condition  (2.1)  for  data 
independence  of  the  weave  in  (3.0)  to  hold.  Since  (3.0)  also  contains  a  projection,  we  have 
to  convince  ourselves  that  A  is  transparent  with  respect  to  S  w  p.U.  For  (3.1)  this  is 
guaranteed  by  Theorems  3.2  and  3.3. 

An  interesting  case  occurs  when  T  is  recursive,  i.e.,  when  it  has  a  subprocess  of  type 
T.  Then  (3.0)  becomes  an  equation  in  T : 

r  =  (Swp.T)‘A 

By  definition  process  T  is  the  least  solution  of  this  equation,  where  ‘least’  is  meant  with 
respect  to  the  subset  order  for  sets  of  traces.  Phrased  differently,  process  T  is  the  least 
fixpoint  of  function  /  defined  by 

f(x)  =  (S  w  p.x)~A  (3.2) 

which  equals  the  following  least  upper  bound  [4]: 

(LUB*  :*  >0  :/*'(( A,  {e}))) 
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For  example,  if  5  is 


(o \v.a\b\  p.b)" 


(3.3) 


or 


a ;  6 ;  (p.a;  a;  6;  p.b)*  (3.4) 

we  have  T  =  SEMi(a,b).  However,  if  S  is 
(a ;  p.a ;  p.b;  b )* 


we  find 


T  =  <{«,&},  {e,o}) 

Theorem  3.4  For  conservative  S  the  least  fixpoint  of  f,  as  defined  in  (3.2),  is  conser¬ 
vative. 

For  some  processes,  for  example,  those  specified  by  regular  expressions  conforming  to 
Theorem  3.3,  it  is  sensible  to  talk  about  the  duration  between  (external)  events,  or,  more 
precisely,  about  the  number  of  ordered  internal  events  between  successive  external  events. 
Let 


T  =  (Swp.f7)“A 
A  sequence  function 

a  :  &S  X  ISI  — ►  N 

(N  the  set  of  natural  numbers)  is  a  function  satisfying 

(Vt,  a,  b,  :  tab  E  tS  A  a  E  a S  A  b  E  a.S 
:  cr(a,£(t~a))  <  cr(b,l(ta~b))) 

We  require  that  subprocess  p  has  a  corresponding  sequence  function  cr1,  i.e.,  one  that 
satisfies 


(Va,  *  :  a  E  aU  A  i  >  0  :  o{p.a,  i)  —  a'(a,i)) 

If  a  is  a  sequence  function  then  so  is  a  +  m  for  all  natural  m. 
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The  process  specified  by  (3.3)  has,  for  example,  the  following  sequence  function: 


<r(a,  i)  =  4  *  i 
a(b,  t)  =  4  *  i  -f  2 

(3.5) 

a(p.a,  i)  —  4  *  i  +  1  v  ‘  ' 

a{p.b,i)  =  4  *  i  +  3 

This  is  an  allowed  sequence  function,  since  o(p.a,i)  =  a(a,i)  +  1,  a(p.b,i)  =  a(b,  t)  +  1, 
and  a  +  1  is  a  sequence  function  for  p. 

We  say  that  process  T  has  constant  response  time  when  there  exists  a  sequence  function 
cr  for  T  such  that 

(3n  :  n  >  1  :  (Vi,  a,  b  :  tab  E  tT  A  a  G  a T  A  b  E  a T 
:  a(b,£(ta~b ))  —  o(a,£(t~a))  <  n)) 

The  process  specified  by  (3.3)  has  constant  response  time:  for  o  as  given  in  (3.5)  the 
condition  above  holds  for  n  =  2.  The  process  specified  by  (3.4)  does  not  have  constant 
response  time.  A  possible  sequence  function  for  that  process  is 

o(a,i)  =  ( i  +  l)2  -  1 
cr(b,i )  —  (i  +  l)2 
o(p.a,i)  =  (i  +  2) 2  -  2 
o(p.b,i)  =  (*  +  2) 2  +  1 

We  have  now  assembled  all  the  theory  we  need  to  discuss  a  number  of  interesting 
examples.  These  are  presented  in  the  next  three  sections. 


4.  Polynomial  Multiplication 

Given  is  a  polynomial  g  of  degree  M,  M  >  0.  For  0  <  i  <  M  we  let  <&  denote  the 
coefficient  of  xx  in  q : 

q  =  Qm  *  xM  +  •  •  •  +  *  x  +  q0 

Polynomial  q  has  to  be  multiplied  by  a  polynomial  r  of  degree  N,  N  >  0,  yielding  a 
polynomial  s  of  degree  M  +  N,  given  by 


sM+N-i  =  (2/  :  max(i  -  M,  0)  <  j  <  min(*',  N)  :  qM+j-i  *  rN-j) 


The  process  that  carries  out  the  multiplication  is  to  have  input  a  and  output  b  with 
V  (a)  =  V{b)  —  Z.  Along  input  a  the  coefficients  r,-  are  transmitted  in  order  of  decreasing 
indices,  followed  by  zeroes: 


if  0  <  i  <  N 
if  i  >  N 


(4.0) 


Along  output  b  the  coefficients  of  s  are  to  be  transmitted,  followed  by  zeroes: 


f  sm+n—%  if  0  <  ®*  <  M  +  N 
10  if  i>  M  +  N 


The  communication  behavior  is  (a;  b)*.  In  view  of  (4.0)  we  have  for  0  <  i  <  M  +  N 


b(i)  =  (E j  :  max(t  -  M,  0)  <  j  <  min(i,  N)  :  qM+j-i  *  a(j))  (4.1) 

We  design  for  0  <  k  <  M  processes  MULk,  that  have  (external)  communication 
behavior  (a  ;  6)*  and,  cf.  (4.1), 


b(i)  =  (E j  :  max(f  -  k,0)  <  j  <  min(i,  N)  :  qk+j-i  *  a(j))  (4.2) 

if  0  <  *  <  k  +  N,  and  6(t)  =  0  if  *  >  k  +  N.  Then  MULm  is  the  process  we  are  interested 
in. 

Process  MUL0  is  simple:  (4.2)  yields  for  k  =  0 

__  /  go  *  «(*)  if  0  <  t  <  N 
w_to  If  i  >  JV 

Since  a(i)  =  0  for  i  >  N,  this  may  be  simplified  to 
6(»)  =  qo  *  o(0 


for  i  >  0.  The  computation  of  MULq  may  be  specified  by 


(a?  x  ;  6!  (go  *  x))* 

We  now  turn  to  MULk  for  1  <  k  <  M.  It  has  a  subprocess  of  type  MULk~\.  Conse¬ 
quently, 

p.a(i)  =  o(0  for  *  >  0  (4.3) 

{(E j  :  max(&‘  —  k  +  1,0)  <  j  <  min(f,iV)  (^-^) 

:  qk-i+j-i  *  a(i))  ifO<i<k  +  N 

0  if  i>k  +  N  (4,5) 


By  (4.2) 

6(0)  =  qk*  o(0) 
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(4.6) 


r 


MULg 


Fig.  1.  Process  MUL* 


and  for  0  <  i  <  k  +  N 


b(i  -f  1)  =  (Ej  :  max(i  -  k  +  1,0)  <  j  <  min(*  +  1,  N)  :  *  o(j)) 

Hence,  by  (4.4), 


p.b{i) 

p.b(i)  +  qk*  a(i  +  1) 


if  N  <  i  <  k  +  N 
if  0  <  i  <  N 


Since  a{i  +  1)  =  0  for  i  >  N,  this  may  be  simplified  to 


b(i  +  1)  =  p.6(0  +qk*  a{i  +  1) 

for  0  <  i  <  k  +  N.  For  i  >  k  +  N  we  have  p.b(i)  =  0  and  a(t  +  1)  =  0.  Consequently, 


b(i  +  1)  =  p.b(i)  +qk*  a(i  +  l) 


(4.7) 


for  i  >  0. 

Relations  (4.3),  (4.6),  and  (4.7)  tell  us  how  the  output  values  may  be  computed.  We 
choose 


(a;  p.a ;  b ;  p.b)*  (4.8) 

as  the  communication  behavior.  Then  p.a( i)  follows  a{i),  as  required  by  (4.3),  6(0)  follows 
a(0),  as  required  by  (4.6),  and  b(i  +  1)  follows  p.b(i)  and  a{i  +  1),  as  required  by  (4.7). 
According  to  Theorem  3.3  alphabet  {a,  6}  is  transparent.  We  have  already  shown  that 
the  process  has  constant  response  time  and  that,  with  S  denoting  the  process  specified  by 
(4.8), 

(S  w  Xfc'A-Ma,  !))'{«,*}  =  SEM,(a,b) 

Given  (4.3),  (4.6),  and  (4.7),  it  is  now  simple  to  specify  the  computation  of  the  output 
values: 

(4.9) 

Process  MULk  consists  of  the  process  specified  by  (4.9),  which  uses  value  qk,  and 
MULk-\  as  a  subprocess.  Figure  1  shows  process  MUL±,  in  which  each  process  that  uses 
value  qk  is  drawn  as  a  rectangle  with  qk  in  it. 


y  :=  0  ;  (a?  x  ;  p.al  x  ;  6!  (y  +  qk  *  x)  ;  p.b ?  y)* 
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We  have  designed  an  array  of  M  + 1  cells.  Each  cell  stores  one  coefficient  of  polynomial 
q.  All  cells  are  equal,  except  for  the  last  one,  which  has  no  right  neighbor.  (We  could  have 
made  this  one  equal  by  adding  a  cell  at  the  end  that  returns  value  0  upon  every  input.) 

The  coefficients  of  polynomials  r  and  s  are  transmitted  in  order  of  decreasing  indices. 
This  order  is  actually  immaterial.  We  could  have  done  the  same  analysis  for  the  reverse 
order,  and  the  only  change  would  have  been  to  replace  qk  in  process  MULk  by  qM-k '  the 
order  in  which  the  coefficients  of  q  axe  distributed  over  the  cells  is  reversed  as  well. 

Our  solution  is  independent  of  the  degree  of  r.  Process  MULm  will  multiply  polynomial 
q  by  polynomials  of  any  degree.  In  order  for  the  complete  product  to  be  produced  at  b, 
we  have  to  require  of  the  input  only  that  the  coefficients  of  r  are  followed  by  at  least  M 
zeroes.  But  afterwards  we  have  y  =  0  and  a  new  polynomial  may  be  input  again!  We  have 
thus  designed  a  systolic  computation  for  repeatedly  multiplying  a  fixed  polynomial  q  by 
other  polynomials.  The  only  restriction  on  the  input  is  that  the  coefficients  of  different 
polynomials  are  separated  by  (at  least)  M  zeroes. 


5.  Cyclic  Encoding 

A  nice  application  of  polynomial  multiplication  is  the  encoding  of  messages,  using  a 
cyclic  code.  Given  is  a  polynomial  q  of  degree  M  with  M  >  1,  <?,-  €=  {0,1},  and  q\f  =  1. 
Polynomial  q  is  often  called  the  generator  polynomial  of  the  cyclic  code.  Each  message 
is  a  sequence  rvrjv-i . . .  r0,  where  r,-  G  {0, 1}  and  N  >  0.  The  message  may  be  regarded 
as  the  coefficients  of  a  polynomial  r.  The  encoded  message  consists  of  the  coefficients  of 
polynomial 

r  *  xM  ©  t  (5.0) 

of  degree  M  +  N,  where  t  is  the  remainder  polynomial  after  division  of  r  *  by  q  and  © 
denotes  addition  modulo  2.  More  precisely,  polynomial  t  is  defined  by 

r*xM  =  q*d(Bt  (5.1) 

where  d  is  a  polynomial  of  degree  N  and  t  has  degree  M  —  1. 

For  example,  if  the  generator  polynomial  is  x4  +  x  +  1  (M  =  4)  and  the  message  is 
101110111,  i.e.,  r  =  x8  +  x6  +  x5  +  x4  +  x2  +  x  +  1,  we  find  for  (5.1) 

x12  +  x10  +  x9  +  x8  +  x6  +  x5  +  x4 

(x4  +  X  +  1)  *  (x8  +  x6  +  x3  +  x)  ©  (x3  +  x2  +  x) 

The  encoded  message  is,  by  (5.0),  1011101111110:  the  sequence  1110  of  M  check  bits, 
corresponding  to  polynomial  x3  +  x2  +  x,  has  been  added  to  the  message.  A  well-known 
example  is  the  use  of  x  +  1  as  generator  polynomial.  It  results  in  adding  a  parity  bit. 
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On  account  of  (5.1),  polynomial  (5.0),  representing  the  encoded  message,  equals  q  *  d. 
Our  encoder,  consequently,  has  to  multiply  polynomials  q  and  d,  a  problem  we  have  already 
solved  in  Section  4.  Polynomial  d  is,  of  course,  not  given,  but  the  amazing  property — 
pointed  out  to  me  by  F.  W.  Sijstermans  of  Philips  Research— is  that  the  coefficients  of  d 
may  be  determined  as  polynomial  r  is  input. 

By  (5.1)  we  conclude 


—  rN 


Notice  that  for  0  <  j  <  N 


(«  *  <2)m+;  =  (r  *  xM  ©  t)M+j  =  (r  *  =  r, 


(5.2) 


(5.3) 


Let  q'  be  a  polynomial  of  degree  M  —  1  such  that 

q  =  xM  ©  q'  (5.4) 

Then,  for  0  <  j  <  N, 

dj  =  {x™  *  d)M+j 

=  {by  (5.4)}  ((g  0  q ')  *  d)M+j 

=  {q*  d)M+j  ©  (q'  *  d)M+j 
=  {by  (5.3)}  rj  ©  (gf  *  d)M+j 

We  introduce  a  subcomponent  p  of  type  MULM- i  that  computes  q'  *  d.  The  output 
of  p  can  be  used  to  determine  both  dj  for  0  <  j  <  N  and,  as  will  turn  out  later,  &(*)  for 
N  <  i  <  N  +  M . 

Our  process  has  input  a  and  output  b.  For  0  <  i  <  N 


a{i)  =  rN-i  (5.6) 

and  for  0  <  i  <  M  +  N 

6(z)  =  (q  *  d)M+N-i 

The  external  communication  behavior  is 
(a  ;  b)N+1 ;  bM 

where  SN  denotes  N  concatenations  of  S,  for  example 
(a;  b)2  =  (a;  b  \  a;  b) 
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We  suggest  to  insert  the  internal  communications  as  follows: 


(a ;  p.a ;  6 ;  p.b)N+1 ;  ( p.a  ;  b ;  p.b)M 
We  have,  according  to  the  specification  of  MULM-\ , 

_/<**-.•  if  0  <  i  <  TV 
P‘  ^  10  if  N  <i<M  +  N 


and 


„  k(i)  =  ((q>*  d)M+N-i-i  if  0  <  i 

W  1 0  if  i  =  A 

For  0  <  *  <  iV  we  have,  by  (5.3), 

*(0  =  {<1  *  d)M+N-i  =  rN-i  =  «(*) 


<  M  +  N 
M  +  N 


(5.7) 


(5.8) 


For  N  <  i  <  M  +  N  we  find 

&(*'  +  1)  =  (q  *  d) Af+jv-,-i 

=  {by  (5.4)}  ((xM  ©  q ')  *  d)M+JV_,_i 

=  *  cJ)a f+jv-t-i  ©  (</  * 

=  {M  +  N  —  i  —  1  <  M  —  1}  (q1  *  d)M+N-i- 1 
=  {by  (5.8)}  P-b{i) 

Outputs  p.a{i)  may  be  determined  as  follows. 

p.a(0)  =  {by  (5.7)}  dN 
=  {by  (5.2)}  rN 
=  {by  (5.6)}  a(0) 

For  0  <  i  <  N  we  derive 

p.a(i  +  1)  =  {by  (5.7)}  dN-i-t 

=  {by  (5.5)}  *•*-,_!  ©  ( q '  *  d)M+N-i-i 
—  {by  (5.6)  and  (5.8)}  a{i  +  1)  ©  p.6(z) 


Furthermore,  p.a(z)  —  0  for  N  <  i  <  M  +  N. 
Summarizing,  we  have 


«(0 

p.b[i  —  1) 


if  0  <  i  <  N 
if  N  <i<  M  +  N 
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and 


a(j)  if  *  =  0 

p.a(i)  =  <  a(*)  ©  p.b{i  —  l)  if  1  <  i  <  N 

.0  if  N  <  i  <  M  +  N 

The  computation  of  these  output  values  may  be  specified  as  follows. 


y  :=  0 

;  (a?  x  ;  p.a\  (x  ©  y)  ;  6!  x  ;  p.b ?  t/)JV+1 
;  (p.a\  0  ;  b\y  ;  p.b ?  y)M 


Since  p.b(M  +  N)  =0,  the  last  statement  reestablishes  y  =  0.  We  have,  therefore,  a 
simple  way  of  changing  the  program  into  one  that  repeatedly  encodes  messages: 


y  :=  0 

;  ((a?  x  ;  p.a\  (x  ©  y)  ;  6!  x  ;  p.b ?  y)^+1 
;  (p.a!0  ;  6!y  ;  p.6?  y)M 
)* 


Making  the  process  independent  of  the  message  length  requires  a  message-separator 
signal.  Calling  that  signal  c,  our  solution  becomes 


y  :=  0 

;  ((a?  x  ;  p.a\  (x  ©  y)  ;  6!  x  ;  p.b ?  y)* 
;  c  ;  (p.a!  0  ;  6!  y  ;  p.6?  y)M 
)* 


By  adding  one  cell  to  MULM-\  (or  one  could  say,  by  changing  the  first  cell  of  MULm)  we 
have  obtained  a  systolic  computation  for  encoding  messages  of  varying  lengths.  Figure  2 
shows  a  drawing  of  the  process. 

The  communication  behavior  at  the  source  side  is  (a* ;  c)*:  the  messages  are  separated 
by  signal  c,  and  the  value  of  M  is  immaterial  to  the  source  side  transmissions. 
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\ - 1 

Fig.  2.  Cyclic  encoder 


6.  Palindrome  Recognition 

In  this  section  we  discuss  a  recursive  palindrome  recogngizer.  The  object  is  to  spec¬ 
ify  a  process  with  external  behavior  (6;  a)*,  where  6  is  output  and  a  is  input,  V (6)  = 
{true, false},  and  V(a)  —  Z.  The  value  of  output  b  has  to  indicate  whether  the  sequence 
thus  far  received  at  input  a  is  a  palindrome.  More  precisely,  for  i  >  0 

b(i)  =  (W  :  0  <  j  <  i  :  a(j )  —  a(i  -  1  -  j)) 

We  have 

6(0)  —  6(1)  =  true  (6.0) 

and  for  i  >  0 

b(i  +  2)  =  (Vj  :  0  <  j  <  i  +  2  :  a(j)  =  a(i  +  1  -  j)) 

=  (a(0)  =  a(i  +  1)) 

A  (Vj  :  1  <  j  <  i  +  1  :  a(j)  =  a{i  +  1-  j)) 

A  (a(i  +  1)  =  a(0)) 

=  (a(0)  =  a(i  +  1)) 

A  (Vj  :  0  <  j  <  i  :  a(j  +  1)  =  a(i  -  j)) 

The  latter  conjunct  is  again  the  outcome  of  a  palindrome  recognizer,  but  now  one  that 
pertains  to  the  input  sequence  beginning  at  a(l).  We,  therefore,  introduce  a  subprocess  of 
the  same  type  as  the  process  we  are  designing:  for  i  >  0 


p.a(i )  =  a(i  -1-  1) 
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(6.2) 


and 


p.b(i)  =  (Vj  :  0  <  j  <  i  :  p.a(j)  =  p.a(i  —  1  —  j)) 

Using  (6.2) ,  the  latter  relation  may  be  written  as 

p.b(i)  —  (Vj  :  0  <  j  <  i  :  a(j  +  1)  =  a(i  -  j)) 

By  (6.1)  we  then  find  for  *  >  0 

b(i  +  2)  =  (a(0)  =  a(i  +  1))  A  p.b(t)  (6.3) 

Since  the  first  two  outputs  at  b  are  computed  differently  from  the  subsequent  ones,  we 
suggest 

b  ;  a ;  b ;  (a ;  p.b ;  b ;  p.a )*  (6.4) 

as  the  communication  behavior.  Then  p.a(i)  follows  a(i  +  1),  cf.  (6.2),  and  b(i  +  2)  follows 
a(i  +  1)  and  p.b(i),  as  required  by  (6.3).  By  Theorem  3.3  alphabet  {a,  6}  is  transparent. 
With  S  denoting  the  process  specified  by  (6.4)  we  have 

(. S  w  p.5)-{a,6}  =  SEM1  (6,  a) 

This  shows  that  the  process  has  the  required  external  behavior.  It  also  has  constant 
response  time.  A  sequence  function  is 

a(a,  *)  =  4  *  t  +  2 
<x(b,  i)  =  4*  i 
o(p.a,i)  =  4  *  i  +  9 
a(p.b ,  *)  =  4  *  *  +  7 

This  is  an  allowed  sequence  function,  since  a  +  7  is  a  sequence  function  for  p  and 

cr(p.a,i )  =  cr(a,i)  +  7 
<r(p.b,  *)  =  o(b,  i)  +  7 

Given  (6.0),  (6.2),  and  (6.3),  it  is  not  difficult  to  extend  (6.4)  with  the  computation  of 
the  output  values: 

6!  true  ;  o?  z  ;  6!  true 
;  (a?  x  ;  p.b?  y  ;  6!  (( z  —  x)  A  y)  ;  p.a!  x 

(Notice  that  a?  z  establishes  a(0)  =  z.) 

We  have  designed  an  infinite  array  of  cells.  With  the  sequence  function  given  above, 
subprocess  p  starts  at  moment  o(p.b,  0) ,  i.e.,  at  moment  7.  In  general,  cell  k,  for  k  >  0, 
starts  at  moment  7  *  fc.  Cell  0  produces  the  answers.  For  *  >  0  answer  b(i)  is  produced 
at  moment  a(b,  i)  —  4*  i.  At  that  moment  all  cells  k  for  which  7  *  k  <  4  *  i  have  started: 
slightly  over  i/2  cells  are  required  to  produce  answer  £>(?'). 
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7.  Conclusion 


Systolic  arrays  are  often  presented  and  explained  by  means  of  pictures.  We  have 
refrained  from  doing  so.  Of  course,  we  showed  a  few  pictures,  but  they  were  merely  used 
as  illustrations:  in  no  way  did  our  discussion  rely  on  them. 

We  discussed  systolic  computations  in  terms  of  their  input/output  behaviors.  This  is 
a  method  for  which  the  formalism  of  trace  theory  is  very  well-suited.  We  have  isolated  the 
concepts  of  data  independence,  transparence,  and  conservatism  as  central  notions  in  the 
study  of  systolic  computations.  We  are  pleased  with  the  nice  way  in  which  these  concepts 
tie  together.  In  contrast  to  what  is  customary,  we  did  not  describe  the  computations  in 
terms  of  global  states.  As  a  matter  of  fact,  we  suspect  that  these  solutions  would  not 
have  been  found  then:  in  [9]  the  palindrome  recognizer  requires  cells  that  are  slightly  more 
complicated  (essentially,  the  combination  of  two  of  ours)  to  achieve  that  communication 
takes  place  with  the  neighbor  cells  only. 

One  of  the  reasons  why  we  want  each  cell  to  have  a  fixed  number  of  neighbor  cells  is 
to  facilitate  the  realization  of  our  computations  as  VLSI  circuits.  The  main  reason  to  have 
all  synchronization  be  accomplished  by  message  passing  is  that  we  would  like  these  VLSI 
circuits  to  be  delay-insensitive  [10],  which  excludes  the  use  of  global  clocks.  The  work 
by  Alain  J.  Martin  on  compiling  CSP-like  programs  into  delay-insensitive  VLSI  circuits 
[6]  shows  that  such  realizations  may  be  obtained  by  introducing  handshaking  protocols  to 
implement  the  communication  actions. 
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